Hi, and welcome to the intro to search tutorial! This is gonna be fun. Let's start.
We won't dwell on this much, but those studying how to search stuff with computers call what they study information retrieval. The term originated in 1950 in a paper about a punch-card based system produced by Calvin Mooers, and has been used in the study of search ever since. (This is about the same time computers started being used for searching.)
We'll introduce some of the standard terminology of information retrieval so you're familiar with it, but don't get too hung up on it. (Here's the first term! Information Retrieval or IR for short. More on Wikipedia ;)
Documents are the building blocks of any search engine. Documents are what you're searching. They can be documentation pages, books, chapters of books, or other semi-structured text. The goal of search is to allow one to run queries over the documents to find what you're looking for. There can be multiple pieces of data associated with each document, and those pieces of data can have different formats, text and datetimes for example. The definition of the different pieces of data that make up a document is called a schema.
There are two main problems to solve when it comes to search: how to index documents, and how to query them. First, we'll talk about indexing.
Sometimes the data we're searching is small enough that we can just scan the full data for matches every time. In that case, we can just use a regular expression. But we're not talking about regexes today. We're talking about the case when we want to search a larger amount of data, really fast, where scanning the entire dataset every time is too slow. That's where a different set of techniques come in, starting with indexing.
Indexing is the process of taking documents, extracting the relevant terms (words, in IR literature) from them, and creating a data structure that can then be used to quickly find the documents again. Here we're going to take a look at indexing a very small set of documents. What is the data structure that is produced? Later, we'll form queries against this data structure to help us find the documents we're looking for.
In [ ]:
# We're using the awesome-tastic search library `whoosh`. Give Matt Chaput a hug if you see him!
import whoosh
In [1]:
# you can run this on the command line with `python index_small_example.py`
%run index_small_example.py
In [25]:
# OK, so what does the index look like? By default, whoosh creates the index in a directory structure on disk, though it can also
# create temporary indexes in memory. Here's the index we created:
! file ex_index/*
Looks like the indexes that Whoosh
is generating for us are a bunch of binary data.
_MAIN_1.toc
is the TableOfContents file, which basically consists of some versioning info (so the format can change and whoosh can tell), a pickled version of the index's schema
, and a list of its segments
. For our example, the schema
looks like this:
schema = Schema(id=ID(unique=True), content=TEXT)
Pretty simple, huh?
Segments
are the way that whoosh
breaks up the meat of its index in order to make its operations fast. Here's what the whoosh
source has to say about them:
Having multiple segments allows quick incremental indexing: just create a
new segment for the new documents, and have the index overlay the new
segment over previous ones for purposes of reading/search. "Optimizing"
the index combines the contents of existing segments into one (removing
any deleted documents along the way).
(More info about how fast Whoosh is)
But what exactly is stored inside of each segment
, and how does that help us search documents?
In [4]:
# OK, let's get set up to take a look at the contents of the index!
from whoosh.index import open_dir
ix = open_dir('ex_index')
print "{} documents in index `ix`".format(ix.doc_count())
r = ix.reader()
In [28]:
# Now we can play around with the contents of the index! Here are some suggestions:
print list(r.all_doc_ids())
print list(r.all_terms())
print list(r.postings('content', 'email').all_ids())
from whoosh import formats
def print_postings(r):
for fieldname, text, docnum, weight, valuestring in r.iter_postings():
print fieldname, text, docnum, weight,
if valuestring is not None:
print formats.Positions().decode_as('frequency', valuestring),
print formats.Positions().decode_as('positions', valuestring),
print
print_postings(r)
# Full API at http://pythonhosted.org/Whoosh/api/reading.html
So ids are the documents IDs and terms are the words from the documents, that's pretty self-explanatory. But what's a posting? In the language of IR, a posting really just means a document ID and any other information that's associated with it And a posting list is a list of postings, mapped to a term in the index. Whoosh includes a few other pieces of information in a posting as well.
But we haven't really talked about indexes yet. You can think of an index the same way you think of the index in the back of a reference book. (They are the same word after all!) In IR literature, the data structure is actually called an inverted index. So what's inverted about it? Turns out they're considering a mapping from documents->words to be a forward index, and words->documents to be an inverted index. So that's why it's inverted. (More here.
When creating an inverted index for a set of documents, whoosh
does some processing that creates something similar to a book index for those documents. For text, that means breaking the text up by possible search terms, which are usually words, and associating each word with the documents it appears in, as well as some other information, such as the frequency of a given word in the document (more on that later). Whoosh
can then combine words that someone searches for and, combined with the information in the index, fetch up the most relevant documents.
We don't really need to think about postings to use whoosh
at a high level, but it's useful to be familiar with the terminology, as it's part of the vocabulary of information retrieval.
OK, now that we know about indexes, we're going to introduce a larger data set so we can search it. This tutorial comes with a set of data generated from a subset of Project Gutenberg. You should have pre-generated metadata and a search index for this data as a part of preparation for this tutorial. If you haven't yet, do it now! It takes a few minutes to run.
In [ ]:
%run extract_metadata.py
%run index_gutentexts.py
In [23]:
from whoosh.qparser import QueryParser
gutenindex = open_dir("gutenindex")
qp = QueryParser("content", schema=gutenindex.schema)
q = qp.parse("women Boston")
with gutenindex.searcher() as s:
results = s.search(q)
print results
print len(results)
You can see that when whoosh
parsed the text query "women Boston" into its query representation, it interpreted it by splitting the query into the two individual words and combining those two with an AND boolean operator. [Here's a summary of whoosh's default query language: http://pythonhosted.org/Whoosh/querylang.html]
Whoosh
returns search results sorted by relevance, most-relevant first. What does it mean by relevance
? Sometimes the same word from the query appears multiple times in a given document. Whoosh
then weights that document more in the results. (In IR literature, this is called term frequency weighting.)
Whoosh
can also give different weightings to different fields, so e.g. it makes a bigger difference if a book's title matches the query than if the book text does.
Play around with the query language. How can I search for...
Modify the query program to search all fields by default (title/author/content/etc.).
Modify the query program to give much higher weight to titles and authors than to the text of the book.
Can you modify the query language to support using boolean operators in a different language? This might help.
This is all on you. Here are some additional things to learn if you get this far. If you want some pointers or something explained, raise your hand and we'll help you out.
http://pythonhosted.org/Whoosh/facets.html
Modify query_webapp.py
to give mispelling suggestions if no results are found for a given search query.
http://pythonhosted.org/Whoosh/keywords.html
Modify query_webapp.py
to serve up results with the search terms highlighted.
http://pythonhosted.org/Whoosh/highlight.html
http://pythonhosted.org/Whoosh/ngrams.html
You may want to start with the smaller index example to play around with N-grams, since full texts of books is not exactly a great use case for search-as-you-type.
In [17]:
print r.frequency('content', 'email')
print list(r.expand_prefix('content', 'e'))
Kragen Sitaker is writing a tiny search engine in pure Python for the next edition of The Architecture of Open Source Applications book, which is a super great read if you're interested in the nitty gritty of the basic data structures behind indexing:
https://github.com/kragen/500lines/tree/master/search-engine
His "Further information" section contains a bunch of great resources for further study as well.
If you like math, you might find Xapian's introduction to IR theory interesting.
The PageRank paper is a classic modern information retrieval read.
If you end up needing to build fast search capabilities into a really, really large data set, you're going to need a more heavyweight tool than whoosh
. ElasticSearch is a popular, well-supported clustering option.